Importação de pacotes¶

In [1]:
#%pip install dataprep
#%pip install pandas
#%pip install matplotlib
#%pip install scikit-learn
#%pip install plotly
#%pip install xgboost
#%pip install category_encoders
#%pip install scikit-learn-intelex
#%pip install kaleido
In [2]:
import plotly.io as pio
import plotly.graph_objects as go

#pio.renderers.default='notebook'
pio.renderers.default='svg'
layout = {
  'width': 1410,
  'height': 525
}
from sklearnex import patch_sklearn
patch_sklearn()

import warnings
warnings.filterwarnings('ignore')
import logging
logging.disable(logging.INFO)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
In [3]:
from IPython.display import display
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
# modelos
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import plot_tree, DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import plotly.express as px
from dataprep.eda import create_report
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from sklearn.inspection import PartialDependenceDisplay

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_percentage_error as mape
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from category_encoders import CatBoostEncoder, CountEncoder
from dataprep.eda import create_report

Importação de dados¶

In [4]:
df = pd.read_csv('iml1_unidade1_dados.csv')

Análise Descritiva¶

In [5]:
display(df.info())
display(df.head(4))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long           21613 non-null  float64
 19  sqft_living15  21613 non-null  int64  
 20  sqft_lot15     21613 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
None
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000

4 rows × 21 columns

Descrição das Colunas:

  • date: data de venda.
  • price: preço (variável a ser predita).
  • bedrooms: número de quartos.
  • bathrooms: número de banheiros.
  • sqft_living: área interna.
  • sqft_lot: área do terreno.
  • floors: número de andares.
  • waterfront: variável binária (0 ou 1) que indica se tem vista para a orla.
  • view: índice de 0 a 4 que verifica quão boa é a vista.
  • condition: condição do imóvel (de 1 a 5).
  • grade: índice de 1 a 13 que indica qualidade da construção e design.
  • sqft_above: área construída acima do nível do solo.
  • sqft_basement: área construída abaixo do nível do solo.
  • yr_built: ano de construção.
  • yr_renovated: ano de reforma.
  • zipcode: CEP.
  • lat: latitude.
  • long: longitude.
  • sqft_living15: área interna dos 15 vizinhos mais próximos.
  • sqft_lot15: área externa dos 15 vizinhos mais próximos.
In [6]:
# Verificar vista agua.
df['waterfront'].value_counts()
Out[6]:
0    21450
1      163
Name: waterfront, dtype: int64
In [7]:
create_report(df)
  0%|          | 0/3780 [00:00<?, ?it/s]
Out[7]:
DataPrep Report
DataPrep Report Overview
Variables ≡
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
Interactions Correlations Missing Values

Overview

Dataset Statistics

Number of Variables 21
Number of Rows 21613
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 4.8 MB
Average Row Size in Memory 232.0 B
Variable Types
  • Numerical: 16
  • Categorical: 5

Dataset Insights

sqft_lot and sqft_lot15 have similar distributions Similar Distribution
price is skewed Skewed
bedrooms is skewed Skewed
bathrooms is skewed Skewed
sqft_living is skewed Skewed
sqft_lot is skewed Skewed
grade is skewed Skewed
sqft_above is skewed Skewed
sqft_basement is skewed Skewed
yr_renovated is skewed Skewed
sqft_lot15 is skewed Skewed
date has a high cardinality: 372 distinct values High Cardinality
date has constant length 15 Constant Length
floors has constant length 3 Constant Length
waterfront has constant length 1 Constant Length
view has constant length 1 Constant Length
condition has constant length 1 Constant Length
long has 21613 (100.0%) negatives Negatives
sqft_basement has 13126 (60.73%) zeros Zeros
yr_renovated has 20699 (95.77%) zeros Zeros
  • 1
  • 2

Variables


id

numerical

Approximate Distinct Count 21436
Approximate Unique (%) 99.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 4.5803e+09
Minimum 1000102
Maximum 9900000190
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • id is skewed right (γ1 = 0.2433)

Quantile Statistics

Minimum 1000102
5-th Percentile 5.1248e+08
Q1 2.123e+09
Median 3.9049e+09
Q3 7.3089e+09
95-th Percentile 9.2973e+09
Maximum 9900000190
Range 9899000088
IQR 5.1859e+09

Descriptive Statistics

Mean 4.5803e+09
Standard Deviation 2.8766e+09
Variance 8.2746e+18
Sum 9.8994e+13
Skewness 0.2433
Kurtosis -1.2605
Coefficient of Variation 0.628

date

categorical

Approximate Distinct Count 372
Approximate Unique (%) 1.7%
Missing 0
Missing (%) 0.0%
Memory Size 1729040

Length

Mean 15
Standard Deviation 0
Median 15
Minimum 15
Maximum 15

Sample

1st row 20141013T000000
2nd row 20141209T000000
3rd row 20150225T000000
4th row 20141209T000000
5th row 20150218T000000

Letter

Count 21613
Lowercase Letter 0
Space Separator 0
Uppercase Letter 21613
Dash Punctuation 0
Decimal Number 302582
  • date has words of constant length

price

numerical

Approximate Distinct Count 4028
Approximate Unique (%) 18.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 540088.1418
Minimum 75000
Maximum 7.7e+06
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • price is skewed right (γ1 = 4.0238)

Quantile Statistics

Minimum 75000
5-th Percentile 210000
Q1 321950
Median 450000
Q3 645000
95-th Percentile 1.1565e+06
Maximum 7.7e+06
Range 7.625e+06
IQR 323050

Descriptive Statistics

Mean 540088.1418
Standard Deviation 367127.1965
Variance 1.3478e+11
Sum 1.1673e+10
Skewness 4.0238
Kurtosis 34.5773
Coefficient of Variation 0.6798
  • price is not normally distributed (p-value 1.2936389939954236e-14)
  • price has 1146 outliers

bedrooms

numerical

Approximate Distinct Count 13
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 3.3708
Minimum 0
Maximum 33
Zeros 13
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • bedrooms is skewed right (γ1 = 1.9742)

Quantile Statistics

Minimum 0
5-th Percentile 2
Q1 3
Median 3
Q3 4
95-th Percentile 5
Maximum 33
Range 33
IQR 1

Descriptive Statistics

Mean 3.3708
Standard Deviation 0.9301
Variance 0.865
Sum 72854
Skewness 1.9742
Kurtosis 49.052
Coefficient of Variation 0.2759
  • bedrooms is not normally distributed (p-value 3.0178300088703286e-18)
  • bedrooms has 546 outliers

bathrooms

numerical

Approximate Distinct Count 30
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 2.1148
Minimum 0
Maximum 8
Zeros 10
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • bathrooms is skewed right (γ1 = 0.5111)

Quantile Statistics

Minimum 0
5-th Percentile 1
Q1 1.5
Median 2.25
Q3 2.5
95-th Percentile 3.5
Maximum 8
Range 8
IQR 1

Descriptive Statistics

Mean 2.1148
Standard Deviation 0.7702
Variance 0.5932
Sum 45706.25
Skewness 0.5111
Kurtosis 1.2793
Coefficient of Variation 0.3642
  • bathrooms is not normally distributed (p-value 8.359028103350753e-13)
  • bathrooms has 266 outliers

sqft_living

numerical

Approximate Distinct Count 1038
Approximate Unique (%) 4.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 2079.8997
Minimum 290
Maximum 13540
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • sqft_living is skewed right (γ1 = 1.4715)

Quantile Statistics

Minimum 290
5-th Percentile 940
Q1 1427
Median 1910
Q3 2550
95-th Percentile 3760
Maximum 13540
Range 13250
IQR 1123

Descriptive Statistics

Mean 2079.8997
Standard Deviation 918.4409
Variance 843533.6814
Sum 4.4953e+07
Skewness 1.4715
Kurtosis 5.2416
Coefficient of Variation 0.4416
  • sqft_living is not normally distributed (p-value 8.007676846434566e-07)
  • sqft_living has 572 outliers

sqft_lot

numerical

Approximate Distinct Count 9782
Approximate Unique (%) 45.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 15106.9676
Minimum 520
Maximum 1651359
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • sqft_lot is skewed right (γ1 = 13.0591)

Quantile Statistics

Minimum 520
5-th Percentile 1800
Q1 5040
Median 7618
Q3 10688
95-th Percentile 43339.2
Maximum 1651359
Range 1650839
IQR 5648

Descriptive Statistics

Mean 15106.9676
Standard Deviation 41420.5115
Variance 1.7157e+09
Sum 3.2651e+08
Skewness 13.0591
Kurtosis 285.0116
Coefficient of Variation 2.7418
  • sqft_lot is not normally distributed (p-value 4.751116629961859e-25)
  • sqft_lot has 2425 outliers

floors

categorical

Approximate Distinct Count 6
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 1469684

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 1.0
2nd row 2.0
3rd row 1.0
4th row 1.0
5th row 1.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 43226
  • The top 2 categories (1.0, 2.0) take over 50.0%
  • floors has words of constant length

waterfront

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 1426458
  • The largest value (0) is over 131.6 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 0
2nd row 0
3rd row 0
4th row 0
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 21613
  • The top 2 categories (0, 1) take over 50.0%
  • The largest value (0) is over 131.6 times larger than the second largest value (1)
  • waterfront has words of constant length

view

categorical

Approximate Distinct Count 5
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 1426458
  • The largest value (0) is over 20.24 times larger than the second largest value (2)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 0
2nd row 0
3rd row 0
4th row 0
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 21613
  • The top 2 categories (0, 2) take over 50.0%
  • The largest value (0) is over 20.24 times larger than the second largest value (2)
  • view has words of constant length

condition

categorical

Approximate Distinct Count 5
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 1426458
  • The largest value (3) is over 2.47 times larger than the second largest value (4)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 3
2nd row 3
3rd row 3
4th row 5
5th row 3

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 21613
  • The top 2 categories (3, 4) take over 50.0%
  • The largest value (3) is over 2.47 times larger than the second largest value (4)
  • condition has words of constant length

grade

numerical

Approximate Distinct Count 12
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 7.6569
Minimum 1
Maximum 13
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • grade is skewed right (γ1 = 0.771)

Quantile Statistics

Minimum 1
5-th Percentile 6
Q1 7
Median 7
Q3 8
95-th Percentile 10
Maximum 13
Range 12
IQR 1

Descriptive Statistics

Mean 7.6569
Standard Deviation 1.1755
Variance 1.3817
Sum 165488
Skewness 0.771
Kurtosis 1.1904
Coefficient of Variation 0.1535
  • grade is not normally distributed (p-value 8.185491070523821e-18)
  • grade has 1911 outliers

sqft_above

numerical

Approximate Distinct Count 946
Approximate Unique (%) 4.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 1788.3907
Minimum 290
Maximum 9410
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • sqft_above is skewed right (γ1 = 1.4466)

Quantile Statistics

Minimum 290
5-th Percentile 850
Q1 1190
Median 1560
Q3 2210
95-th Percentile 3400
Maximum 9410
Range 9120
IQR 1020

Descriptive Statistics

Mean 1788.3907
Standard Deviation 828.091
Variance 685734.6673
Sum 3.8652e+07
Skewness 1.4466
Kurtosis 3.4012
Coefficient of Variation 0.463
  • sqft_above is not normally distributed (p-value 5.712255190179444e-07)
  • sqft_above has 611 outliers

sqft_basement

numerical

Approximate Distinct Count 306
Approximate Unique (%) 1.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 291.509
Minimum 0
Maximum 4820
Zeros 13126
Zeros (%) 60.7%
Negatives 0
Negatives (%) 0.0%
  • sqft_basement is skewed right (γ1 = 1.5779)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 560
95-th Percentile 1190
Maximum 4820
Range 4820
IQR 560

Descriptive Statistics

Mean 291.509
Standard Deviation 442.575
Variance 195872.6684
Sum 6.3004e+06
Skewness 1.5779
Kurtosis 2.7147
Coefficient of Variation 1.5182
  • sqft_basement is not normally distributed (p-value 1.4557358841725596e-24)
  • sqft_basement has 496 outliers

yr_built

numerical

Approximate Distinct Count 116
Approximate Unique (%) 0.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 1971.0051
Minimum 1900
Maximum 2015
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • yr_built is skewed left (γ1 = -0.4698)

Quantile Statistics

Minimum 1900
5-th Percentile 1915
Q1 1951
Median 1975
Q3 1997
95-th Percentile 2011
Maximum 2015
Range 115
IQR 46

Descriptive Statistics

Mean 1971.0051
Standard Deviation 29.3734
Variance 862.7973
Sum 4.2599e+07
Skewness -0.4698
Kurtosis -0.6575
Coefficient of Variation 0.0149

yr_renovated

numerical

Approximate Distinct Count 70
Approximate Unique (%) 0.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 84.4023
Minimum 0
Maximum 2015
Zeros 20699
Zeros (%) 95.8%
Negatives 0
Negatives (%) 0.0%
  • yr_renovated is skewed right (γ1 = 4.5492)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 0
Maximum 2015
Range 2015
IQR 0

Descriptive Statistics

Mean 84.4023
Standard Deviation 401.6792
Variance 161346.2119
Sum 1.8242e+06
Skewness 4.5492
Kurtosis 18.6965
Coefficient of Variation 4.7591
  • yr_renovated is not normally distributed (p-value 4.609893220146306e-25)
  • yr_renovated has 914 outliers

zipcode

numerical

Approximate Distinct Count 70
Approximate Unique (%) 0.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 98077.9398
Minimum 98001
Maximum 98199
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • zipcode is skewed right (γ1 = 0.4056)

Quantile Statistics

Minimum 98001
5-th Percentile 98004
Q1 98033
Median 98065
Q3 98118
95-th Percentile 98177
Maximum 98199
Range 198
IQR 85

Descriptive Statistics

Mean 98077.9398
Standard Deviation 53.505
Variance 2862.7878
Sum 2.1198e+09
Skewness 0.4056
Kurtosis -0.8536
Coefficient of Variation 0.00054554

lat

numerical

Approximate Distinct Count 5034
Approximate Unique (%) 23.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 47.5601
Minimum 47.1559
Maximum 47.7776
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • lat is skewed left (γ1 = -0.4852)

Quantile Statistics

Minimum 47.1559
5-th Percentile 47.3103
Q1 47.471
Median 47.5718
Q3 47.678
95-th Percentile 47.7496
Maximum 47.7776
Range 0.6217
IQR 0.207

Descriptive Statistics

Mean 47.5601
Standard Deviation 0.1386
Variance 0.0192
Sum 1.0279e+06
Skewness -0.4852
Kurtosis -0.6764
Coefficient of Variation 0.002913
  • lat has 2 outliers

long

numerical

Approximate Distinct Count 752
Approximate Unique (%) 3.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean -122.2139
Minimum -122.519
Maximum -121.315
Zeros 0
Zeros (%) 0.0%
Negatives 21613
Negatives (%) 100.0%
  • long is skewed right (γ1 = 0.885)

Quantile Statistics

Minimum -122.519
5-th Percentile -122.387
Q1 -122.328
Median -122.23
Q3 -122.125
95-th Percentile -121.979
Maximum -121.315
Range 1.204
IQR 0.203

Descriptive Statistics

Mean -122.2139
Standard Deviation 0.1408
Variance 0.01983
Sum -2.6414e+06
Skewness 0.885
Kurtosis 1.049
Coefficient of Variation -0.001152
  • long is not normally distributed (p-value 0.0025777370896867095)
  • long has 256 outliers

sqft_living15

numerical

Approximate Distinct Count 777
Approximate Unique (%) 3.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 1986.5525
Minimum 399
Maximum 6210
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • sqft_living15 is skewed right (γ1 = 1.1081)

Quantile Statistics

Minimum 399
5-th Percentile 1140
Q1 1490
Median 1840
Q3 2360
95-th Percentile 3300
Maximum 6210
Range 5811
IQR 870

Descriptive Statistics

Mean 1986.5525
Standard Deviation 685.3913
Variance 469761.2399
Sum 4.2935e+07
Skewness 1.1081
Kurtosis 1.5964
Coefficient of Variation 0.345
  • sqft_living15 is not normally distributed (p-value 0.002310571902041566)
  • sqft_living15 has 544 outliers

sqft_lot15

numerical

Approximate Distinct Count 8689
Approximate Unique (%) 40.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 345808
Mean 12768.4557
Minimum 651
Maximum 871200
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • sqft_lot15 is skewed right (γ1 = 9.5061)

Quantile Statistics

Minimum 651
5-th Percentile 1999.2
Q1 5100
Median 7620
Q3 10083
95-th Percentile 37062.8
Maximum 871200
Range 870549
IQR 4983

Descriptive Statistics

Mean 12768.4557
Standard Deviation 27304.1796
Variance 7.4552e+08
Sum 2.7596e+08
Skewness 9.5061
Kurtosis 150.728
Coefficient of Variation 2.1384
  • sqft_lot15 is not normally distributed (p-value 4.9922979195239255e-25)
  • sqft_lot15 has 2194 outliers

Interactions

Correlations

Missing Values

Report generated with DataPrep

Análise Exploratória¶

In [8]:
df.style
df['price'].quantile([
    0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5,
    0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, .97, 0.99],
    interpolation='lower')
    
Out[8]:
0.00      75000.0
0.05     210000.0
0.10     245000.0
0.15     270000.0
0.20     298450.0
0.25     321950.0
0.30     345000.0
0.35     370000.0
0.40     399500.0
0.45     425000.0
0.50     450000.0
0.55     482000.0
0.60     519000.0
0.65     550000.0
0.70     595000.0
0.75     645000.0
0.80     700000.0
0.85     779380.0
0.90     887000.0
0.95    1156000.0
0.97    1388000.0
0.99    1960000.0
Name: price, dtype: float64
In [9]:
# Preço por quantidade de quartos
px.scatter(df, x='bedrooms', y='price', color='waterfront', **layout)
No description has been provided for this image
  • é possível notar um outlier com 33 quartos.
In [10]:
# Remover outlier
df = df.drop(df[df.bedrooms == 33].index)
px.scatter(df, x='bedrooms', y='price', color='waterfront', **layout)
No description has been provided for this image

Preço por quantidade de banheiros¶

In [11]:
# Preço por quantidade de banheiros 
px.scatter(df, x='bathrooms', y='price', color='waterfront', **layout)
No description has been provided for this image

Preço por data de venda¶

In [12]:
# Preço por data venda
def toDateTime(df,column):
  dt = pd.to_datetime(df[column])
  return dt.apply(lambda x: int(x.timestamp()))

df['date'] = toDateTime(df, 'date')
df['date'] = df['date']-1300000000
px.scatter(df, x='date', y='price', color='waterfront', **layout)
No description has been provided for this image

Criar coluna de data desde construção ou reforma.¶

  • Preço pelo ano de construção ou reforma
In [13]:
# criar coluna year com o ano desde construcao ou reforma
df['year'] = np.where(df['yr_built'] < df['yr_renovated'], df['yr_renovated'], df['yr_built'])
px.scatter(df, x='year', y='price', color='date', **layout)
No description has been provided for this image

Preço por área interna¶

In [14]:
px.scatter(df, x='sqft_living', y='price', color='year', **layout)
No description has been provided for this image

Preço por área externa¶

In [15]:
px.scatter(df, x='sqft_lot', y='price', color='year', **layout)
No description has been provided for this image

Preço por área acima do piso¶

  • Foi verificado que sqft_above + sqft_basement == sqft_living
In [16]:
df.size == df[df.sqft_above + df.sqft_basement == df.sqft_living].size
Out[16]:
True
In [17]:
px.scatter(df, x='sqft_above', y='price', **layout)
No description has been provided for this image

Preço por área abaixo do piso¶

In [18]:
px.scatter(df, x='sqft_basement', y='price', **layout)
No description has been provided for this image

Preço por número de pisos¶

In [19]:
px.scatter(df, x='floors', y='price', **layout)
No description has been provided for this image

Preço por Latitude¶

In [20]:
px.scatter(df, x='lat', y='price', **layout)
No description has been provided for this image

Preço por Longitude¶

In [21]:
px.scatter(df, x='long', y='price', **layout)
No description has been provided for this image

Vizinhança - área interna¶

In [22]:
px.scatter(df, x='sqft_living15', y='price', **layout)
No description has been provided for this image

Vizinhança - área externa¶

In [23]:
px.scatter(df, x='sqft_lot15', y='price', **layout)
No description has been provided for this image

Analisar novamente o resultado do report.¶

In [24]:
#create_report(df)

Engenharia de Recursos¶

Parâmetros selecionados:¶

  • bedrooms
  • bathrooms
  • sqft_above
  • sqft_basement
  • floors
  • waterfront
  • view
  • condition
  • grade
  • year: ano mais recente entre yr_built e yr_renovated

Parâmetros não selecionados:¶

  • sqft_living: é a soma de sqft_above e sqft_basement
  • sqft_lot: parece não ter correlação
  • yr_built, yr_renovated: foram utilizadas para gerar uma coluna 'year' que apresenta a mais recente
  • zipcode: parece não ter correlação
  • lat, long: parece não ter correlação
  • sqft_living15: correlação com preço parece semelhante a sqft_living
  • sqft_lot15: tem correlação com sqft_lot

Divisão dos Dados¶

In [25]:
df_train, df_test = train_test_split(df, test_size=1/10, random_state=0)
In [26]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

myCols = ['bedrooms', 'bathrooms', 'sqft_above', 'sqft_basement', 
          'floors', 'waterfront', 'view', 'condition', 'grade', 'year']
transformer = ColumnTransformer([
  ("selector", "passthrough", myCols)
])
In [27]:
X_train = transformer.fit_transform(df_train)
features = transformer.get_feature_names_out()
X_train = pd.DataFrame(X_train, columns=features)
X_test = pd.DataFrame(transformer.transform(df_test), columns=features)
In [28]:
y_train = df_train['price']
y_test = df_test['price']

Seleção de Modelos¶

In [29]:
# dataframe que armazenará o resultado dos modelos
resultados = pd.DataFrame(columns=['Method', 'MSE', 'MAE', 'MAPE'])
modelosTreinados = []
In [30]:
def plot_coefs(features, coefs, n=20, split_by_sign=True):
    df = pd.DataFrame({'feature': features, 'coef': coefs})

    # Filtrando e ordenando os dados
    coefs_negativos = df.nsmallest(n, 'coef').loc[lambda df: df.coef < 0].sort_values('coef', ascending=False)
    coefs_positivos = df.nlargest(n, 'coef').loc[lambda df: df.coef >= 0].sort_values('coef', ascending=True)
    trace_positivo = go.Bar(
        y=coefs_positivos['feature'], x=coefs_positivos['coef'],
        orientation='h', marker=dict(color='blue'))
    # Criando o subplot
    if split_by_sign:
        fig = make_subplots(rows=1, cols=2, subplot_titles=('Coeficientes Negativos', 'Coeficientes Positivos'))
        fig.add_trace(
            go.Bar(
                y=coefs_negativos['feature'], x=coefs_negativos['coef'],
                orientation='h', marker=dict(color='red'),
                name='Coeficientes Negativos'),
            row=1, col=1)
        fig.add_trace(trace_positivo, row=1, col=2)
        width = 1200
    else:
        fig = make_subplots(rows=1, cols=1, subplot_titles=('Coeficientes',))
        fig.add_trace(trace_positivo, row=1, col=1)
        width = 600
    fig.update_layout(barmode='stack', showlegend=False, width=width, height=800)
    fig.show()
In [31]:
# função que plota a Nota real vs Nota Predita
def plot_predict(y_test, y_predict):
  fig = px.strip(pd.DataFrame({'Nota real': y_test, 'Nota predita': y_predict}), x='Nota real', y='Nota predita', **layout)
  fig.show()

Linear (MMQ)¶

In [32]:
# treinando modelo
linear = linear_model.LinearRegression().fit(X_train, y_train)

# plotando maiores coeficientes
plot_coefs(features=features, coefs=linear.coef_)

# plotando nota real x nota predita
plot_predict(y_test, y_predict = linear.predict(X_test))

predict = linear.predict(X_test)
resultados.loc[len(resultados)] = {
    'Method': "Mínimos Quadrados",
    'MSE': mean_squared_error(predict, y_test),
    'MAE': mean_absolute_error(predict, y_test),
    'MAPE': mean_absolute_percentage_error(predict, y_test)
}
modelosTreinados.append({'Method': "Mínimos Quadrados", 'modelo': linear, 'RFE': False})

print('RESULTADOS: ')
display(resultados)
No description has been provided for this image
No description has been provided for this image
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823

Linear (RFECV)¶

In [33]:
from sklearn.feature_selection import RFECV
linearRFE = linear_model.LinearRegression()
linear_rfe = RFECV(
    linearRFE, min_features_to_select=3,
    step=0.3, verbose=10, cv=2
    ).fit(X_train,y_train)
print('Variáveis selecionadas: ', features[linear_rfe.support_])
Fitting estimator with 10 features.
Fitting estimator with 7 features.
Fitting estimator with 4 features.
Fitting estimator with 10 features.
Fitting estimator with 7 features.
Fitting estimator with 4 features.
Variáveis selecionadas:  ['selector__bedrooms' 'selector__bathrooms' 'selector__sqft_above'
 'selector__sqft_basement' 'selector__floors' 'selector__waterfront'
 'selector__view' 'selector__condition' 'selector__grade' 'selector__year']
In [34]:
# treinando modelo
treino = X_train[features[linear_rfe.support_]]
teste = X_test[features[linear_rfe.support_]]
linearRFE = linear_model.LinearRegression().fit(treino,y_train)

# plotando maiores coeficientes
plot_coefs(
    features=[i for (i, v) in zip(features, linear_rfe.support_) if v],
    coefs=linearRFE.coef_)

# plotando nota real x nota predita
predict = linearRFE.predict(teste)
plot_predict(y_test, y_predict=predict)

# salvando resultados
resultados.loc[len(resultados)] = {
    'Method': "Mínimos Quadrados RFE",
    'MSE': mean_squared_error(predict, y_test),
    'MAE': mean_absolute_error(predict, y_test),
    'MAPE': mean_absolute_percentage_error(predict, y_test)
}
modelosTreinados.append({'Method': "Mínimos Quadrados RFE", 'modelo': linearRFE, 'RFE': True})

print('RESULTADOS: ')
display(resultados)
No description has been provided for this image
No description has been provided for this image
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823

Lasso¶

In [35]:
# treinando modelo
lasso = linear_model.LassoCV(cv=5).fit(X_train,y_train)

# plotando MSE x Penalização
plt.figure()
plt.plot(lasso.alphas_, lasso.mse_path_.mean(axis=-1),linewidth=2)
plt.ylim(bottom=0)
plt.xlabel('Penalização')
plt.ylabel('MSE')
plt.show()

# imprimir melhor alpha
print(f'Best Alpha: {lasso.alpha_}')
print(f'Número de coeficientes iguais a 0: {sum(lasso.coef_ == 0)}')

# plotando maiores coeficientes
plot_coefs(features=features, coefs=lasso.coef_)

# plotando nota real x nota predita
plot_predict(y_test, y_predict=lasso.predict(X_test))

predict = lasso.predict(X_test)
resultados.loc[len(resultados)] = {
    'Method': "Lasso",
    'MSE': mean_squared_error(predict, y_test),
    'MAE': mean_absolute_error(predict, y_test),
    'MAPE': mean_absolute_percentage_error(predict, y_test)
}
modelosTreinados.append({'Method': "Lasso", 'modelo': lasso, 'RFE': False})

print('RESULTADOS: ')
display(resultados)
No description has been provided for this image
Best Alpha: 186344.29681685785
Número de coeficientes iguais a 0: 7
No description has been provided for this image
No description has been provided for this image
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823
2 Lasso 5.625945e+10 161634.607828 0.316445

KNN¶

In [36]:
# criando modelo
knn = KNeighborsRegressor()

# otimização de hiperparametros
param_grid = {'n_neighbors': np.arange(1, 100, 5)}
knn_gscv = GridSearchCV(knn, param_grid, cv=5,
                        scoring = make_scorer(mse,greater_is_better=False))
knn_gscv.fit(pd.DataFrame(X_train, columns=features), y_train)

print(f'Best Paramns: {knn_gscv.best_params_}')

px.line(x=param_grid['n_neighbors'], y=knn_gscv.cv_results_['mean_test_score']*-1, width=600).show()

# plotando partial dependence
PartialDependenceDisplay.from_estimator(knn_gscv, pd.DataFrame(X_test, columns=features), ['selector__waterfront','selector__view', 'selector__condition'])
PartialDependenceDisplay.from_estimator(knn_gscv, pd.DataFrame(X_test, columns=features), ['selector__bedrooms' , 'selector__bathrooms',  'selector__sqft_above'])

# plotando nota real x nota predita
plot_predict(y_test, y_predict=knn_gscv.predict(pd.DataFrame(X_test, columns=features)))

# salvando resultados
predict = knn_gscv.predict(pd.DataFrame(X_test, columns=features))
resultados.loc[len(resultados)] = {
    'Method': "KNN",
    'MSE': mean_squared_error(predict, y_test),
    'MAE': mean_absolute_error(predict, y_test),
    'MAPE': mean_absolute_percentage_error(predict, y_test)
}
modelosTreinados.append({'Method': "KNN", 'modelo': knn_gscv, 'RFE': False})

print('RESULTADOS: ')
display(resultados)
Best Paramns: {'n_neighbors': 26}
No description has been provided for this image
No description has been provided for this image
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823
2 Lasso 5.625945e+10 161634.607828 0.316445
3 KNN 5.303844e+10 151427.693695 0.282410
No description has been provided for this image
No description has been provided for this image

KNN RFE¶

In [37]:
# criando modelo
knn = KNeighborsRegressor()

# otimização de hiperparametros
param_grid = {'n_neighbors': np.arange(1, 100, 5)}
knn_gscv_RFE = GridSearchCV(knn, param_grid, cv=5,
                        scoring = make_scorer(mse,greater_is_better=False))
selectedColumns = features[linear_rfe.support_]
knn_gscv_RFE.fit(pd.DataFrame(X_train[selectedColumns], columns=selectedColumns), y_train)

print(f'Best Paramns: {knn_gscv_RFE.best_params_}')

px.line(x=param_grid['n_neighbors'], y=knn_gscv_RFE.cv_results_['mean_test_score']*-1, width=600).show()


# plotando nota real x nota predita
y_predict = knn_gscv_RFE.predict(pd.DataFrame(X_test[selectedColumns], columns=selectedColumns))
plot_predict(y_test, y_predict=y_predict)

# salvando resultados
resultados.loc[len(resultados)] = {
    'Method': "KNN RFE",
    'MSE': mean_squared_error(y_predict, y_test),
    'MAE': mean_absolute_error(y_predict, y_test),
    'MAPE': mean_absolute_percentage_error(y_predict, y_test)
}
modelosTreinados.append({'Method': "KNN RFE", 'modelo': knn_gscv_RFE, 'RFE': True})

print('RESULTADOS: ')
display(resultados)
Best Paramns: {'n_neighbors': 26}
No description has been provided for this image
No description has been provided for this image
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823
2 Lasso 5.625945e+10 161634.607828 0.316445
3 KNN 5.303844e+10 151427.693695 0.282410
4 KNN RFE 5.303844e+10 151427.693695 0.282410

Árvore¶

In [38]:
# criando modelo
tree = DecisionTreeRegressor(random_state=0)

# otimização de hiperparametros
param_grid = {'max_depth':np.arange(1, 20)}
tree_gscv = GridSearchCV(tree, param_grid, cv=5,scoring=make_scorer(mse,greater_is_better=False))
tree_gscv.fit(X_train, y_train)
print(tree_gscv.best_params_)

px.line(x=param_grid['max_depth'], y=tree_gscv.cv_results_['mean_test_score']*-1, width=600).show()

plot_coefs(features=features, coefs=tree_gscv.best_estimator_.feature_importances_, split_by_sign=False)

fig, ax = plt.subplots(figsize=(15, 15))
plot_tree(tree_gscv.best_estimator_, fontsize=8, ax=ax, feature_names=features)
fig.show()

# salvando resultados
predict = tree_gscv.predict(X_test)
resultados.loc[len(resultados)] = {
    'Method': "Árvore de Decisão",
    'MSE': mean_squared_error(predict, y_test),
    'MAE': mean_absolute_error(predict, y_test),
    'MAPE': mean_absolute_percentage_error(predict, y_test)
}
modelosTreinados.append({'Method': "Árvore de Decisão", 'modelo': tree_gscv, 'RFE': False})

print('RESULTADOS: ')
display(resultados)
{'max_depth': 6}
No description has been provided for this image
No description has been provided for this image
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823
2 Lasso 5.625945e+10 161634.607828 0.316445
3 KNN 5.303844e+10 151427.693695 0.282410
4 KNN RFE 5.303844e+10 151427.693695 0.282410
5 Árvore de Decisão 4.251878e+10 137624.229449 0.261200
No description has been provided for this image

Floresta¶

In [39]:
# criando modelo
floresta = RandomForestRegressor(random_state = 0)

# otimização de hiperparametros
param_grid = {'max_depth': [3, 7, 15], 'n_estimators': [100, 200, 300], 'max_samples': [.8, None], 'min_samples_split': [2, 5, 10]}
floresta_rscv = RandomizedSearchCV(floresta, param_grid, cv=10, scoring=make_scorer(mse,greater_is_better=False), n_iter=10)
floresta_rscv.fit(X_train, y_train)
print(floresta_rscv.best_params_)
{'n_estimators': 200, 'min_samples_split': 10, 'max_samples': None, 'max_depth': 15}
In [40]:
# utilizando o melhores parametros
floresta.set_params(**floresta_rscv.best_params_)
floresta.fit(X_train,y_train)

plot_coefs(features=features, coefs=floresta.feature_importances_, split_by_sign=False)

# plotando partial dependence
PartialDependenceDisplay.from_estimator(floresta, pd.DataFrame(X_test, columns=features), ['selector__waterfront','selector__view', 'selector__condition'])
PartialDependenceDisplay.from_estimator(floresta, pd.DataFrame(X_test, columns=features), ['selector__bedrooms' , 'selector__bathrooms',  'selector__sqft_above'])

# salvando resultados
predict = floresta.predict(X_test)
resultados.loc[len(resultados)] = {
    'Method': "Floresta",
    'MSE': mean_squared_error(predict, y_test),
    'MAE': mean_absolute_error(predict, y_test),
    'MAPE': mean_absolute_percentage_error(predict, y_test)
}
modelosTreinados.append({'Method': "Floresta", 'modelo': floresta, 'RFE': False})

print('RESULTADOS: ')
display(resultados)
No description has been provided for this image
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823
2 Lasso 5.625945e+10 161634.607828 0.316445
3 KNN 5.303844e+10 151427.693695 0.282410
4 KNN RFE 5.303844e+10 151427.693695 0.282410
5 Árvore de Decisão 4.251878e+10 137624.229449 0.261200
6 Floresta 3.294960e+10 121894.476730 0.233217
No description has been provided for this image
No description has been provided for this image

XGBoost¶

In [41]:
# criando modelo
xgb_model = xgb.XGBRegressor(early_stopping_rounds=10)

# otimização de hiperparametros
parameters = {'max_depth': [3,7,12],
              'min_child_weight': [5,10,20],
              'subsample': [0.8,1],
              'colsample_bytree': [0.75,1],
              'n_estimators': [400],
              'eta':[0.01,0.1,0.5]}

X_train_1, X_val, y_train_1, y_val = train_test_split(X_train, y_train, test_size=1/10, random_state=7)
eval_set = [(X_train_1,y_train_1),(X_val, y_val)]

xgb_model_gscv = RandomizedSearchCV(xgb_model, parameters, n_jobs=4, cv=10,scoring=make_scorer(mse,greater_is_better=False), n_iter=10)
xgb_model_gscv.fit(
    X_train_1, y_train_1,
    eval_set=eval_set,
    verbose=0)


print(xgb_model_gscv.best_params_)
xgb_results = xgb_model_gscv.best_estimator_.evals_result()

# plot
epochs = len(xgb_results['validation_0']['rmse'])
x_axis = range(0, epochs)
px.line(pd.concat([
    pd.DataFrame({'epoch': x_axis, 'RMSE': xgb_results['validation_0']['rmse']}).assign(dataset='Train'),
    pd.DataFrame({'epoch': x_axis, 'RMSE': xgb_results['validation_1']['rmse']}).assign(dataset='Validation'),
]), x='epoch', y='RMSE', color='dataset', width=650).show()

# plotando as variaveis mais importantes
plot_coefs(features=features, coefs=xgb_model_gscv.best_estimator_.feature_importances_, split_by_sign=False)

# salvando resultados
predict = xgb_model_gscv.predict(X_test)
resultados.loc[len(resultados)] = {
    'Method': "XGBoost",
    'MSE': mean_squared_error(predict, y_test),
    'MAE': mean_absolute_error(predict, y_test),
    'MAPE': mean_absolute_percentage_error(predict, y_test)
}
modelosTreinados.append({'Method': "XGBoost", 'modelo': xgb_model_gscv, 'RFE': False})

print('RESULTADOS: ')
display(resultados)
{'subsample': 1, 'n_estimators': 400, 'min_child_weight': 10, 'max_depth': 7, 'eta': 0.01, 'colsample_bytree': 0.75}
No description has been provided for this image
No description has been provided for this image
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823
2 Lasso 5.625945e+10 161634.607828 0.316445
3 KNN 5.303844e+10 151427.693695 0.282410
4 KNN RFE 5.303844e+10 151427.693695 0.282410
5 Árvore de Decisão 4.251878e+10 137624.229449 0.261200
6 Floresta 3.294960e+10 121894.476730 0.233217
7 XGBoost 3.285581e+10 122489.895915 0.233375

NNets¶

In [42]:
batch_size = 200
epochs = 200
nNetsModel = keras.Sequential(
    [
        keras.Input(shape=(X_train.shape[1], )),
        layers.Dense(30, activation="relu"),
        layers.Dropout(0.2),
        layers.Dense(20, activation="relu"),
        layers.Dropout(0.2),
        layers.Dense(5, activation="relu"),
        layers.Dense(1, activation="linear"),
    ]
    )

nNetsModel.summary()

nNetsModel.compile(loss="mse", optimizer="adam", metrics=["mae"])

X_train_dense = X_train
X_test_dense  = X_test

history = nNetsModel.fit(
    X_train_dense, y_train,
    batch_size=batch_size,
    epochs=epochs, validation_split=0.1,
    shuffle=True,verbose=0
    )

# plotando MSE
plt.figure()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('mse')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# salvando resultados
# salvando resultados
predict = nNetsModel.predict(X_test_dense)
mse_estimate=mse(predict, y_test)
resultados.loc[len(resultados)] = {
    'Method': "NNets",
    'MSE': mean_squared_error(predict, y_test),
    'MAE': mean_absolute_error(predict, y_test),
    'MAPE': mean_absolute_percentage_error(predict, y_test)
}
modelosTreinados.append({'Method': "NNets", 'modelo': nNetsModel, 'RFE': False})

print('RESULTADOS: ')
display(resultados)
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 30)             │           330 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 30)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │           620 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_1 (Dropout)             │ (None, 20)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 5)              │           105 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 1)              │             6 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,061 (4.14 KB)
 Trainable params: 1,061 (4.14 KB)
 Non-trainable params: 0 (0.00 B)
No description has been provided for this image
68/68 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823
2 Lasso 5.625945e+10 161634.607828 0.316445
3 KNN 5.303844e+10 151427.693695 0.282410
4 KNN RFE 5.303844e+10 151427.693695 0.282410
5 Árvore de Decisão 4.251878e+10 137624.229449 0.261200
6 Floresta 3.294960e+10 121894.476730 0.233217
7 XGBoost 3.285581e+10 122489.895915 0.233375
8 NNets 5.620638e+10 157116.843844 0.327772

Resultado¶

In [43]:
print('RESULTADOS: ')
display(resultados)
RESULTADOS: 
Method MSE MAE MAPE
0 Mínimos Quadrados 4.086821e+10 137028.682193 0.340823
1 Mínimos Quadrados RFE 4.086821e+10 137028.682193 0.340823
2 Lasso 5.625945e+10 161634.607828 0.316445
3 KNN 5.303844e+10 151427.693695 0.282410
4 KNN RFE 5.303844e+10 151427.693695 0.282410
5 Árvore de Decisão 4.251878e+10 137624.229449 0.261200
6 Floresta 3.294960e+10 121894.476730 0.233217
7 XGBoost 3.285581e+10 122489.895915 0.233375
8 NNets 5.620638e+10 157116.843844 0.327772

Classificação de um novo indivíduo¶

In [44]:
#myCols
individoAvaliado = X_test[features].iloc[0]
individoAvaliado['selector__bedrooms'] = 1.0
individoAvaliado['selector__bathrooms'] = 1.0
individoAvaliado['selector__sqft_above'] = 1500.0
individoAvaliado['selector__sqft_basement'] = 0.0
individoAvaliado['selector__floors'] = 2.0
individoAvaliado['selector__waterfront'] = 0.0
individoAvaliado['selector__view'] = 3.0
individoAvaliado['selector__condition'] = 4.0
individoAvaliado['selector__grade'] = 8.0
individoAvaliado['selector__year'] = 2010.0
individoAvaliado = individoAvaliado.to_frame().T

#val = modelosTreinados[8]['modelo'].predict(individoAvaliado)[0]
for modelo in modelosTreinados:
  avaliacao = modelo['modelo'].predict(individoAvaliado)[0]
  if(type(avaliacao) == np.ndarray):
    avaliacao = avaliacao[0]
  print(modelo['Method'] + ": " + str(avaliacao))
Mínimos Quadrados: 569809.3822638495
Mínimos Quadrados RFE: 569809.3822638495
Lasso: 304542.8641839186
KNN: 451961.26923076925
KNN RFE: 451961.26923076925
Árvore de Decisão: 633329.8468085106
Floresta: 603676.9349586806
XGBoost: 546656.44
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step
NNets: 372902.88